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Abstract 

Background: Mutations as sources of evolution have long been the focus of attention in the biomedical literature. 
Accessing the mutational information and their impacts on protein properties facilitates research in various 
domains, such as enzymology and pharmacology. However, manually curating the rich and fast growing repository 
of biomedical literature is expensive and time-consuming. As a solution, text mining approaches have increasingly 
been deployed in the biomedical domain. While the detection of single-point mutations is well covered by 
existing systems, challenges still exist in grounding impacts to their respective mutations and recognizing the 
affected protein properties, in particular kinetic and stability properties together with physical guantities. 

Results: We present an ontology model for mutation impacts, together with a comprehensive text mining system 
for extracting and analysing mutation impact information from full-text articles. Organisms, as sources of proteins, 
are extracted to help disambiguation of genes and proteins. Our system then detects mutation series to correctly 
ground detected impacts using novel heuristics. It also extracts the affected protein properties, in particular kinetic 
and stability properties, as well as the magnitude of the effects and validates these relations against the domain 
ontology. The output of our system can be provided in various formats, in particular by populating an OWL-DL 
ontology, which can then be gueried to provide structured information. The performance of the system is 
evaluated on our manually annotated corpora. In the impact detection task, our system achieves a precision of 
70.4%-71.1%, a recall of 71.3%-71.5%, and grounds the detected impacts with an accuracy of 76.5%-77%. The 
developed system, including resources, evaluation data and end-user and developer documentation is freely 
available under an open source license at http://www.semanticsoftware.info/open-mutation-miner. 

Conclusion: We present Open Mutation Miner (OMM), the first comprehensive, fully open-source approach to 
automatically extract impacts and related relevant information from the biomedical literature. We assessed the 
performance of our work on manually annotated corpora and the results show the reliability of our approach. The 
representation of the extracted information into a structured format facilitates knowledge management and aids in 
database curation and correction. Furthermore, access to the analysis results is provided through multiple 
interfaces, including web services for automated data integration and desktop-based solutions for end user 
interactions. 



Background 

Vast amounts of research is dedicated to the identifica- 
tion of mutations and their impacts. Biologists usually 
make inferences about functions of novel sequences by 
comparing them to the functions of known sequences 
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[1]. Many mutagenesis experiments are performed to 
improve the properties of proteins, particularly enzymes. 
Additionally detection of disease causal mutations 
attracted a lot of attention. The result of all these efforts 
lies in publications, particularly in textual format. Conse- 
quently, locating and retrieving this information is a very 
cumbersome task. Some databases try to manually curate 
such information and provide it in publicly accessible 
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form. However, even for an expert curator, extracting this 
information manually is laborious. Hence, database cura- 
tors now increasingly reach for text-mining procedures. 

Large-scale attempts resulted in high levels of perfor- 
mance in the realization of the automatic extraction of 
mutations [2-10]. Yet, finding their impacts, affected 
protein properties and magnitudes of effects remains 
challenging. 

MEMA [2] uses regular expressions to extract mutations 
and mutation-gene pairs. It focuses on the co-occurrence 
of mutations and genes within a sentence and proximity 
parameters within an abstract. The performance of the 
system is evaluated on a set of 100 abstracts. The reported 
recall and precision for the mutation detection task are 
>67% and >96%, respectively. 

MuteXt [3] searches for mutation data using a pattern 
matching approach and further validates the extracted 
point mutations using two plausibility filters: A sequence 
filter and distance filter. The performance of the system 
is evaluated on two corpora. Their algorithm detects 
49.3%-64.5% of point mutations with a specificity of 
85.8%-87.9%. 

Mutation GraB [4] takes a dictionary-based approach 
to identify protein and gene names while extracting point 
mutation terms using regular expressions, utilizing graph 
bigrams to disambiguate the extracted protein point 
mutations. The authors evaluate the effectiveness of their 
approach on the articles describing three protein families, 
namely, tyrosine protein kinases, GPCRs and transmem- 
brane ion. 

MuGex [5] uses 12 regular expressions to detect muta- 
tions and statistical techniques to disambiguate between 
protein mutations and nucleotide mutations or cell lines. 
Gene-mutation pairs are detected through proximity 
measures. 

The MutationFinder system [6] extends MuteXt [3]'s 
rules to extract and normalize point mutations. 

In recent work [7], the authors present a strategy to 
integrate information about phenotypic effect of SNPs 
from UniProtKB and pathways from Reactom and Bio- 
PAX for visualization in Cytoscape. 

Yip et al. [8] uses 4 regular expressions to extract and 
retrieve single amino acid poly morphisms (SAPs). The 
system is assessed on a Swiss-Prot corpus with 9820 
PubMed references. Additionally, each pattern is evalu- 
ated separately. 

The mSTRAP (Mutation extraction and STRucture 
Annotation Pipeline) system [9] was developed with the 
aim of annotating mutations and representing them as 
instances of an ontology. They further use mSTRAPviz 
to read the populated ontology and visualize the annota- 
tions on protein structures. 

EnzyMiner [10] tries to categorize PubMed abstracts 
based on the impact of a protein level mutation on the 



stability and activity of a given enzyme. Using different 
classification algorithms, EnzyMiner is able to narrow 
down search results; however, detailed information 
about the direction of the impacts, association of 
impacts to mutations and the kind of change in stability 
or functionality is not provided. Although EnzyMiner 
targets mutation impact information, it differs signifi- 
cantly from our approach, as we are concerned with 
sentence-level detection and semantic analysis of muta- 
tion impacts, not document classification. 

In [11], the authors introduced the first rule-based 
approach to extract mutation impacts on protein proper- 
ties while categorizing the directionality of the impacts and 
grounding the impacts to the mutations. The extracted 
information was populated to a domain ontology for 
further querying through a web service. While in the afore- 
mentioned work, molecular properties and the Michaelis 
constant (K m ), the rate constant (K cat ) and the compound 
variable {K cat I K m ) are considered, the other protein prop- 
erties, such as the remaining kinetic constants and protein 
stability, are ignored. On the corpus of 13 documents on 
haloalkane dehalogenase, the authors report a recall of 34% 
and a precision of 86% for the mutation-impact relation 
extraction task. 

A recent work on the extraction of kinetic information 
and associated information, namely, enzyme names, EC 
numbers and localization is presented in [12]. The pro- 
posed rule- and dictionary-based approach in this system 
is applied to PubMed abstracts and the results are pro- 
vided in KID, the Kinetic Database [13]. 

KiPar [14], an information retrieval system, focuses on 
kinetic modeling of metabolic pathways using a rule- 
based approach. 

However, all the existing approaches are unable to 
extract the protein properties affected by mutations. In 
this paper, we present a rule-based approach to extract 
mutation series, modified protein properties and magni- 
tudes of effects [15,16]. In our system, the relation 
between the magnitudes of effects and the protein proper- 
ties are detected and validated against the domain ontol- 
ogy. To provide for effective querying and analysis, we 
populate a domain ontology with the extracted informa- 
tion. Table 1 summarizes the scope of our Open Mutation 
Miner (OMM) system, compared to existing approaches. 
Further details on these tasks are provided in the following 
section. 

Methods 

In order to comprehensively extract mutation impacts, the 
detection of several named entities and their relations, in 
particular mutations and protein properties, is required. As 
an example, consider the following text segment (format- 
ting used: bold face: Mutation; underlined : Impact expres- 
sion; underlined non-italics : Protein property; underlined 



Table 1 Literature mining approaches for mutations and impacts 



MEMA MuteXt Mutation GraB MuGex mSTRAP Mutation Miner MutationFinder Yip et al. Mehren et al. Laurila et al. OMM 
[2] [3] [4] [5] [9] [20] [6] [8] [7] [11] 



Mutation Tagging VVVVVV V V V V 

Mutation Series Tagging (V) V 

Mutation-Protein Grounding V V V V V V 

Impact Tagging V V 

Impact-Mutation Grounding V V 

Protein Property Tagging (V) V 

Physical Quantity Tagging V 

Impact-Protein Property Grounding V 

Protein Property-Physical Quantity V 
Grounding 

Visualization V V V 

Ontology Export V (V) V 

Web Service Access (V) V 
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bold : Physical quantity) [17]: "Several single mutants 
(Q1SK, Q1SR, W37K, and W37R), double mutants {Q15K- 
W37K, Q1SK-W37R, Q15R-W37K, and Q15R-W37R), and 
triple mutants (Q1SK-D36A-W37R and Q15K-D36S- 
W37R) were prepared and expressed as glutathione S- 
transferase {GST) fusion proteins in Escherichia coli and 
purified by GSH-agarose affinity chromatography. Mutant 
Q15K-W37R and mutant Q15R-W37R showed compar- 
able activity for NAD and NADP with an increase in activ- 
ity nearly 3fold over that of the wild type. " 

In this example, we need to extract increase as an 
impact that is caused "comparably" by two mutation 
pairs, Q1SK-W37R and Q1SR-W37R. In other words, the 
two aforementioned mutations have the same impact on 
the activity of an enzyme, glutathione S-transferase 
{GST) that is residing in the host organism, Escherichia 
coli. We are also interested to know that activity as a 
kinetic property of the mutant enzyme is measured 3fold 
higher than the activity of the wild-type enzyme. Note 
that other entities, such as single mutations {Q1SK, 
Q15R, W37K, and W37R), exist in the text segment, but 
here we are only interested in the entities that are related 
to the identified impact. The result of the system should 
be a set of detected entities, correctly normalized and 
grounded, and linked with each other. 

After detecting organism mentions, which is handled by 
a separate module, the OrganismTagger [18], the first step 
of impact analysis is to detect impact mentions. However, 
extracting only impacts is not sufficient; we want to know 
which mutation caused the impact. Hence, the system 
needs to ground the detected impacts to mutations. Addi- 
tionally, mutations can appear in the form of mutation 



series (see the above example). Thus, the system must also 
be able to identify these complex mutation expressions. 
Finding out which protein properties were affected by the 
mutations and to what extent is necessary to identify 
advantageous mutations. Towards this end, we export the 
analysis results into an ontology (so-called ontology popu- 
lation [19]) for further applications, including queries and 
summarization. An overview of our system is presented in 
Figure 1. In what follows, we will provide a detailed 
description of each task. 

Impact ontology 

Our Impact Ontology is an extension to the ontology 
described in [20], conceptualizing impacts and the muta- 
tions associated with them (Figure 2). The use of the 
impact ontology facilitates advanced queries and impact 
extraction. The ontology contains information about sev- 
eral concepts: Text elements, biological entities and entity 
relations, e.g., Sentence, Mutationlmpact and measured- 
With, respectively. We extended the ontology with new 
classes, such as MichaelisMentenConstant, SpecificActivity, 
and MaximalVelocity. Our ontology has a rich set of rela- 
tionships between the concepts. Main concepts modeling 
impacts on a semantic level are: 

Mutation: An alteration or a change to a gene and 
developing a different offspring. 

UnitOfMeasurement: A standard for measuring the 
physical quantity. 

Mutationlmpact: The expansion of an impact can be 
presented as a bifurcating tree: each bifurcating node 
represents a mutation effect on protein properties, 
whether the impact is measurable or not. 



Input Data 



/ GATE }- 



Preprocessing 



Mutation Extraction 



Protein Properties 
Extraction 



Impact Extraction 
OWL Ontology Export 



Output Data 




Figure 1 Open Mutation Miner (OMM) System Overview Input documents are processed through a text mining pipeline implemented in 
GATE, which (1) performs preprocessing; (2) detects mutation mentions; (3) detects protein properties; and (4) detects impact mentions and links 
the detected entities. Results can be exported in various formats, in particular by populating the OMM ontology. 
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Figure 2 OMM impact ontology Visualization of the main concepts in the OMM Impact Ontology, which formally describes the domain of 
mutation impact analysis using the Web Ontology Language (OWL). 



ProteinProperty: A class for protein properties, which 
subsumes kinetic properties, protein function, and protein 
stability. 

Information about the effect of mutations on proteins 
can be modeled at different granularity levels. For exam- 
ple, the effect can be on the structure, which conse- 
quently can affect various properties of the proteins. For 
a finer level of granularity, we represent all these rela- 
tions. The relations between these entities, expressed as 
OWL object properties, are listed in Table 2. 

Each protein property is measured with specific units of 
measurement, for example Michaelis Menten Constant is 
measured with units such as per second, per minute, etc. 
However, in interpreting the mutation impacts, not only 
are these units of measurement utilized, but also ratio 
measurements can be used. For example, the measured 
values of the affected protein property are compared with 
the measured values of the wild type or other mutated 
protein properties, and specified by percent, fold or orders 
of magnitude. We decided to establish some restrictions 
on the units of measurement with which each protein 
property is measured, as well as the ratio measurement 
units. These restrictions are encoded in the ontology 
based on global standards (SI [21]), where protein proper- 
ties are measured by specific units of measurements. 



These constraints are encoded as possible value fillers for 
the measuredWith slot for a specific protein properly. For 
instance, Km can be measured with fold, per second and 
per minute, etc. We also defined a datatype property for 
protein properties, called physicalQuantity, referring to 
the value and the unit of measurement found in the text. 

Mutation extraction component 

Single point mutations can be expressed in single-letter 
standard format or through more complex representations. 
We integrated one external mutation detection system and 
also developed our own approach. 
Mutation! agger 

Our MutationTagger, based on previous work [20], 
extracts single point mutations using grammar rules and 
normalizes them to their single-letter format. 

However, mutational mentions in the form of natural 
language, such as "Met for Val substitution found at posi- 
tion 270" are currently ignored by our system. 
MutationFinder 

The MutationFinder system [6] accomplishes the task of 
single mutation detection and normalization by using reg- 
ular expressions. MutationFinder also tries to identify 
mutational changes expressed in natural language. How- 
ever, it still fails at extracting all mentions. 



Table 2 Mutation impact concepts in the Open Mutation Miner ontology 



Object Property 


Domain 


Range 


Description 


hasProperty 
impactOn 
measuredWith 
mutationMutlmpactRel 


Protein 
Mutlmpact 
ProteinProperty 
Mutation 


ProteinProperty 
ProteinProperty 
U n itOfMeasurement 
Mutlmpact 


Which protein the protein property belongs to 
Identifies the protein property affected by a mutation 

Holds between protein property and the corresponding unit of measurement 
Associates an impact with a mutation 


Datatype Property 


Domain 


Range 


Description 


physicalQuantity 


U n itOfMeasu rement 


value 


Identifies the magnitude of a mutation effect on the protein property 
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Mutation series 

In the simplest case, impacts are results of single muta- 
tions. However, impacts are occasionally caused by muta- 
tion series. For example, in our corpus of 40 full-text 
documents, 6% of the mutation mentions are mutation 
series. 

During the manual inspection of several hundred 
documents containing mutation series, it became 
obvious that these mutation series (complex mutation 
expressions) may have different appearances and repre- 
sentations. They can be described using special symbols, 
such as "/" and or keywords, such as double 
mutants, triple mutants, etc. Table 3 summarizes these 
different forms. 

Therefore, mutations connected with these special 
characters or preceded by the keywords are considered 
as mutation series and detected through regular expres- 
sions. To ensure that these detected mutation series 
have one identical internal representation, we further 
normalize them to the format, where all the mutations 
in a series are separated by the notation "/". 

Protein properties extraction component 

Mutations can alter the structure of proteins that subse- 
quently results in affecting their functions, either by gain- 
ing a function or losing one. Mutations may also affect the 
stability of proteins, where the ratio of the unfolded pro- 
tein increases or decreases compared to the folded protein. 
Mutagenesis experiments are constantly performed to 
identify the importance of protein residues, either to find 
the source of a disease or a cure to one. Furthermore, stu- 
dies are done to improve enzyme functions. 

In our system, protein properties are expressed in 
RDF format [22] and detected through gazetteering. 

The extracted information can then be correlated to 
impacts in subsequent processing steps. 
Molecular function 

Understanding the role of mutations, in particular their 
contribution to diseases like cancer, requires identifying 
their impact on molecular functions. Causative mutations 
can drive cancers by activating a protein function or in- 



activating a function. They can promote cancer progres- 
sion by their resistance to drugs or, according to a recent 
study, switching of functions [23]. 

Detection of the functional impact of mutations has 
not only drawn attention in cancer study, but has also 
been an important matter in re-sequencing efforts. 

To detect molecular functions, we use the concepts pre- 
sented by the Gene Ontology. We generate an RDF repre- 
sentation of molecular functions from a download of the 
Gene Ontology. The Gene Ontology is provided in OBO- 
XML format, where each node is one entry (Figure 3). We 
first check for molecular Junction namespaces, then, we 
extract the name and GO ID, as well as the synonyms of 
the entry. Using this information, we generate our RDF 
file. For obtaining further information, molecular functions 
are specified by their Gene Ontology ID (Figure 4). The 
format of a triple is CI rdf s : subClassOf C2, where 
rdf s : subClassOf is an instance of rdf : Property 
and states that CI, here recognized as the Gene Ontology 
ID, is an instance of rdf s : Class and a subclass of C2, 
an instance of rdf s : Class, "molecular function". The 
resulting RDF is then used for gazetteering using an LKB 
gazetteer component [24]. 
Kinetic constants 

Depending on their interests, enzyme and protein engi- 
neers apply recombinant DNA technology to improve 
enzyme kinetic values and stability or identify the roles of 
residues. Consider a study on the role of Asnl07 in 
humans [25]: "To examine the role of Asnl07 in the cataly- 
tic mechanism of human XR, mutant forms (N107D and 
N107L) were prepared. The two mutations increased Km 
for the substrate {>26-fold) and Kd for NADPH (95-fold), 
but only the N107L mutation significantly decreased kcat 
value. " 

Here, two prepared mutations, N107D and N107L, affect 
three kinetic values, Michaelis Menten constant (Km), 
Turn-over number (Kcat) and Dissociation constant (Kd), 
of the protein. To capture these kinetic properties, we 
manually compiled them from the scientific literature. The 
list of these properties is by no means exhaustive. How- 
ever, property synonyms add complexity to later tasks 



Table 3 Mutation series examples 



notation 


Mutation Series 


: The double mutant Thr48Ser: Trp93Ala and the triple mutant Thr48Ser:Trp57Met : Trp93Ala ... 


/ 


For G223D, H225N, G223D/T224I, T224I/H225N, and E156/173D mutants, ... 


Kinetic studies of the mutant A14S-Q15K-D36S- W37R indicated that the apparent Km . . . 


double 


a double mutant (N190V and W191S) and triple mutant (Q137M, L143F and H146L) resulted . . 


triple 


the triple mutant, R501 A,R451 A,K439A, which eliminates all of . . . 


quadruple 


E130D/S325T/S477G/Q481K quadruple mutations in wild-type E. coli XLI-Blue. 


quintuple 


...The quintuple mutant V26T R47F A74G F87V L188K of P450BM-3(P450BM-3 QM) converts ... 


+ 


The reaction of the D179W+R258E+R272D variant of CiP with . . . 
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<term> 

<id>GO:0000014</id> 

<name>single-stranded DNA specific endodeoxyribonuclease activity </name> 
<namespace>molecular function</namespace> 

<def> 

<defstr> Catalysis of the hydrolysis of ester linkages within a single -stranded deoxyribonucleic acid molecule 
<dbxref> 

<acc>mah</acc> 

<dbname>GOC</dbname> 
</dbxref> 
</def> 

<synonym scope=" exact" > 
<synonym text>ssDNA-specific endodeoxyribonuclease activity </synonym text> 
<dbxref> 
<acc>mah</acc> 
<dbname>GOC</dbname> 
</dbxref> 
</synonym> 

<is a>GO:0004520</is_a> 
</term> 

Figure 3 Molecular function in OBO-XML The example shows the molecular function single-stranded DNA specific endodeoxyribonuclease 
activity {GO ID:00000I4), encoded in OBO-XML. 



where the relations are extracted and validated against the 
ontology. A simple RDF schema allows us to deal with dif- 
ferent term representations of a concept and to resolve all 
aliases of the same concept. A triple is defined as CI 
rdf s : subClassOf C2, where C2 is an instance of 
rdf s : Class, "ProteinProperty". The rdf s : label is an 
instance of rdf : Property, where rdf s : domain is 
rdf s : resource and the rdf s : range is literal. 

Normalizing all aliases to one single representation 
can also be helpful when populating the output ontol- 
ogy. Consider Half-life as an example, it can appear in 
different variations, e.g., tO.5, tl/2 and half-lives. All 
these variations are represented as labels in the afore- 
mentioned RDF, thus, in case any of them matches, the 
mention is normalized to Half-life. 
Kinetic values 

Knowing the magnitude of protein properties affected by 
mutations enables biologists to better compare the muta- 
tion impacts of their interests. As an example, consider 
this bio-engineering study that was conducted on quino- 
protein Glucose Dehydrogenase to improve the thermal 
stability of the enzyme [26]: "The halflife at 55°C of 
Ser415Cys {183 min) was approx 36-fold greater than that 
of the wild-type enzyme (5 min) and 4-fold greater than 
that of the Ser231Lys variant (40 min)." 



The Ser residue at position 415 is chosen for construct- 
ing different variants of the enzyme and compared with 
the S231K variant. Analyzing which variant results in the 
most thermostable enzyme requires the extraction of the 
magnitudes. Half-life of S41SC is measured as 183 min, 
whereas S231K was measured as 40 min, and the mea- 
sured half-lives of the two mutations are also compared to 
that of the wild-type enzyme. 

The magnitudes of protein properties are expressed in 
signed numbers, decimals and ranges of values for a sin- 
gle parameter. 

Since the existing GATE generic tokeniser [24,27] can 
only detect digits, we developed a simple tokeniser to cap- 
ture possible representations of magnitudes. To ensure that 
we extract the reported ranges of magnitudes, we collected 
possible range representations from the literature and 
expressed them through grammar rules. After detecting all 
possible values, we check which values express a physical 
quantity using the patterns and discard all other values. 
Units of measurement 

Units of measurement are expressed in various formats, 
in mass or molar concentration (e.g., mg/ml or mmol/1), 
in different systems (e.g., unit, katal) and different scales 
(e.g., mM, and nM). Finding how a magnitude is 
measured requires detecting units of measurement. 



<rdfs: label > single -stranded DNA specific endodeoxyribonuclease activity </rdfs: label > 

<rdfs:subClassOf rdf: resource=" http://www.semanticsoftware.info/ molecular_function #MolecularFunction"/> 
<rdf : type rdf : resource =" http://www.w3.Org/2002/07/owl#Class" /> 
</rdf: Description > 

Figure 4 Molecular function in RDF For processing in OMM, the concepts are mapped from OBO-XML to RDF, here shown for the example of 
a molecular function, single-stranded DNA specific endodeoxyribonuclease activity {GO ID:00000I4). 
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Using the same approach as for kinetic properties, the 
list of units of measurement was collected from the litera- 
ture and encoded in an RDF schema. The RDF schema is 
limited to one subclass hierarchy and assigns the units of 
measurement to their identified concept in the OWL-DL 
ontology. Consider the unit of measurement, per second, 
the same concept as PerSecond is encoded in the OWL- 
DL ontology (see Impact Ontology Section). If any of the 
representations of per second is detected in the document, 
the class PerSecond is assigned to it, facilitating the ontol- 
ogy population step. 
Physical quantities 

We use the information about the units of measurement 
to extract physical quantities. Usually, units of measure- 
ment follow values, except for a few with no specific 
units of measurement, such as pH. More succinctly: 

physical quantity = value + unit of measurement 

After reviewing the literature, we designed a set of 
patterns to capture physical quantities. 

Impact extraction component 

Mutations are considered as sources of species evolution. 
Some result in beneficial changes, while others have detri- 
mental effects. It is important to not only find impacts, 
but also to mark the origin mutation and altered protein 
properties for further analysis. A system capable of analyz- 
ing mutation impacts requires information from many 
entities. Impact analysis consists of the following steps: 

1. Finding impact expressions. 

2. Finding mutations or mutation keywords. 

3. Identifying the polarity of the impact to detect 
advantageous and disadvantageous impacts. 

4. Grounding the impacts to mutations to find which 
mutations lead to a specific impact. 

5. Finding the affected protein properties. 

6. Finding the magnitude of the effect to help bio- 
engineers compare the effects and find the most favour- 
able mutations. 

For the first step, we use ontology based gazetteering, 
with the help of the morphological analyzer [24], to cap- 
ture term variations. Using some heuristics (see Grounding 
section), we attempt to ground the impacts to the detected 
mutations. Possible kinetic values are found using a cus- 
tom tokeniser and validated by some rules. The magnitude 
of an impact is detected through heuristics and validated 
against the domain ontology. The last task solved by the 
system is to find the protein properties changed by a 
mutation, which is also done through additional heuristics. 
impact gazetteer list generation 

To identify the polarity of the impacts, we use the devel- 
oped OWL ontology encoding the type information of 
the impacts. Using an onto-gazetteer NLP component 



[24], the text matches the gazetteer list entries, and the 
impact type class in the ontology is assigned to the text. 
The impact gazetteer lists for positive, negative, neutral 
and non-measurable impacts, consisting of 130 words, 
were also compiled from the literature. 

Furthermore, the impact terms appear in different 
forms. For example, activates, activate, activated, activat- 
ing are all potential impact words; The problem of the 
term variation can be alleviated by stemming. All the 
aforementioned variations of activate have the same root: 
"activate". The morphological analyzer [24] provides the 
root of the impact words, and by matching the stemming 
result against the prepared impact gazetteer lists, all the 
various representations can be detected. In the above 
example, by adding activate to our list, we can detect 
activates, activate, activated, activating. 
Impact detection 

Now that we gathered all the impact expressions, we 
will use this information to mark the impacts. The 
scope of the impact should be limited to the part of a 
sentence expressing the impact. Consider the following 
example [28]: "The effects of the S136A and Y149F 
mutations on the Km values for NADP{H) were law, but 
the K153M mutation caused increases of more than 53- 
fold in the values, which suggests that Lysl53 is involved 
in the coenzyme binding. " 
Three impacts are expressed in the above example: 
. The effects of Y149F on the Km values for NADP 
(H) was low 

. The effects of K158Q on the Km values for NADP 
(H) was low 

• K153M caused increases of more than 53-fold in the 
values 

However, this representation can not provide users with 
thorough information, in particular when a comparison 
between multiple impacts is made. Hence, we expand the 
scope of the impact to each sentence. In case impact 
words are detected in a sentence, the sentence is marked 
as an impact sentence. When multiple impact words are 
detected in a sentence, the sentence is marked multiple 
times as an impact sentence. 

Relying on impact word expressions alone to detect 
impact sentences would lead to many false positives. As 
the next example illustrates, the impact expression 
'reduced' exists in the sentence, however, the sentence 
does not express an impact of a mutation [29]: "The lim- 
ited degree of flexibility in thermophilic enzymes results in 
reduced catalytic efficiency when compared to that of their 
mesophilic counterpart at low temperatures." 

On the other hand, extracting only the sentences con- 
taining mutation mentions and impact word expressions 
results in many false negatives [30]: "Indeed, the N249Y 
substitution increases by six-fold the turnover number 
measured at 6SC with benzyl alcohol as substrate. 
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Furthermore, the affinity for coenzymes is substantially 
lower than that of the wt protein {Michaelis constant KM 
for p, 25-fold greater). " 

In the above example, the first impact the increase of 
the turnover number can be grounded to the mutation 
N249Y. However, the second impact on the affinity for 
coenzymes is embedded within the context. If we only 
extract the impact sentences containing mutations, we 
would ignore the second impact. Therefore, to capture 
impact sentences effectively, we extract all the sentences 
containing impact word expressions, and further filter 
them if no mutations or special vocabularies describing 
a change to a protein exist in the sentence. 

The impact expressions existing in one noun phrase 
are considered as one expression. 

Furthermore, if an impact expression appears in a verb 
phrase followed by another impact expression in a noun 
phrase, we consider them as one impact expression. 
Impact grounding 

As discussed earlier, bio-engineers are interested in 
knowing what kind of effects an engineered mutation 
can lead to. For this reason, the system must be able to 
accurately determine which mutation introduces a speci- 
fic impact. This is accomplished by a number of 
heuristics. 



Once the entities such as mutations, mutation series 
and impact words are identified and annotated, impact 
expressions are associated with mutations. The algo- 
rithm for semantic assignment (Figure 5) is as follows: 
We first check if there exist any impact expressions in a 
given sentence, if yes all the mutations in the sentence 
are collected and analysed according to the following 
cases. 

Case 1: If the impact sentence contains one mutation, 
then all the impact expressions in the sentence are 
grounded to that detected mutation (the complete sen- 
tence is considered as an impact sentence). The detected 
mutation can be a single mutation or a mutation series. 

Case 2: If there exists more than one mutation: 

1. We check if the mutations are connected with con- 
junctions such as and and or; if yes, the impact is 
grounded to every detected mutation (the complete sen- 
tence is annotated multiple times, each time with one of 
the detected mutations). Some of these mutations can 
be mutation series, such as N190V/W191S. 

2. Mutations or mutation series are not connected with 
conjunctions such as and and or; in this case, the impact 
is grounded to the nearest detected mutation or mutations 
(in case the nearest mutations are connected by conjunc- 
tions, the impact is grounded to each mutation). 



Sentence 
Impact >= 1 



Mutatiow = 1 



tation > 1 



Connected by 
conjunctions? 




All impacts are 
grounded to the 
detected mutation 



Impact(s) are grounded 
to the nearest mutation(s) 



Each impact 
is grounded to 
every detected mutation 



Figure 5 Impact grounding heuristics Each sentence is analysed for the occurrence of mutation and impact entities. Depending on the 
number of entities and the syntactic structure of the sentence, impacts are connected with mutations as shown in the figure. 
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Case 3: If no mutations are found in the sentence, the 
impact is grounded to the nearest mutation or mutations, 
making the simple assumption that the nearest mutation 
or mutation series invokes the impacts mentioned. 

ImpactOn relation detection 

To help bio-engineers find their favourable mutation, we 
need to determine which protein properties are altered. 
Consider the following example [28]: "In this study, we 
have confirmed the roles ofSerl36, Tyrl49 and Lysl53 of 
XR as the catalytic triad by drastic loss of activity result- 
ingfrom the mutagenesis ofS136A, Y149F and K1S3M in 
ratXR." 

Two prerequisite pieces of information, the impact 
expression, loss and the protein property changed by the 
mutation, activity are detected. Now we attempt to associ- 
ate the appropriate pair. We use a simple heuristic to 
detect which protein property is affected by a mutation: 

1. We first check if there exists one impact in a given 
sentence, if so the sentence is searched for protein prop- 
erties. We assume that all detected protein properties 
are altered by the impact. 

2. If multiple impacts are detected in a sentence, each 
impact is linked to the nearest protein property. 

The impactOn relation is represented with the sen- 
tence containing the impact expression and the protein 
property. The result annotation of the above example is 
shown in Table 4. 

MeasuredWith relation detection 

At this stage, we find relations between the protein 
property affected by an impact and units of measure- 
ment and effect magnitudes (numerical values). Con- 
sider the following two examples: (1) [29] "The mutant 
SsADH displays improved thermal stability , as indicated 
by the increase in Tm from 90 to 93°C , which was deter- 
mined by the apparent transition curves." (2) [31] 
"Except for Thr416Val/Thr417Val, which had a Km 
value of 16 mM , the mutants had Km values identical 
to 20 mM Km value of the wild-type enzyme." 

In the first example, to know how much the thermal sta- 
bility was improved by the mutation, we need to link the 
extracted protein property, thermal stability, with the phy- 
sical quantity, 90 to 93°C. In the second example, we need 
to detect that the Km property of the protein affected by 
the double mutant, T416V/T417V, is measured as 16 mM, 
while other mutants had the same Km value as that of the 



wild-type enzyme, which is measured as 20 mM. To fulfil 
our objective of relating the protein properties with their 
units of measurement, we use simple proximity heuristics. 
The detected relation candidates are then validated against 
the domain ontology (see Impact Ontology section): If the 
detected physical quantity is not among the possible value 
fillers of the slots for the aforementioned protein property, 
the relation candidate is discarded. Consider the following 
example [32] : "In addition, the half lives at 60°C of the 
R1S6E and N173D xylanases were respectively 6 and 40 
min longer than that of the wild-type enzyme even in the 
absence of substrate. " 

The protein property half lives is measured with min- 
ute, hour and fold; in the above example, the closest 
physical quantity is 60°C, and once the relation is vali- 
dated against the ontology, it is discarded as Degree Cel- 
sius is not one of the fillers of half life and the correct 
filler 6 and 40 min is assigned. 

To represent the relation, we mark the sentence as a 
measuredWith relation with two features, property name 
and physical quantity. 

Ontology population 

To provide protein engineers and scientists with com- 
prehensive information and a more expressive model, 
we populate our domain ontology with the extracted 
information, which can then be queried as a knowledge 
base. Since manually populating the ontology is a cum- 
bersome task, we integrated the OwlExporter [33,34] 
component to automate this task. 

Two ontologies are required to export our extracted 
information, our domain Impact ontology and a NLP 
ontology provided with the OwlExporter component 
[34]. The NLP ontology contains concepts such as 
Document and Sentence. 

"While populating our domain ontology, the OwlEx- 
porter automatically populates the NLP ontology. Indivi- 
duals of our domain concepts, such as Mutation, 
Mutationlmpact and ProteinProperty are asso- 
ciated with the individuals of the NLP concepts, such as 
Sentence. We can then invoke more advanced 
queries, e.g., finding all the extracted impacts in a speci- 
fic sentence. 

In order to be able to export the entities and the rela- 
tionships to our domain ontology, we need to assign 
OWLExportClass and OWLExportRelation anno- 
tation types to the document annotations. This is 



Table 4 Result annotation example 



impactOn 


... of Serl 36, Tyr149 and Lysl 53 of XR as the catalytic triad by drastic loss of activity ... 


Protein property 


activity 


Impact Expression 


loss 



The ontology property 'impactOn' connects impact expressions with the affected protein property. 



Naderi and Witte BMC Genomics 2012, 13(Suppl 4):S10 
http://www.biomedcentral.eom/1 471 -2 1 64/1 3/S4/S1 0 



Page 11 of 1 7 



achieved with additional JAPE grammar rules. By assign- 
ing these two types of annotations to our document 
annotations, we inform the OwlExporter about the 
annotations we want to export. 

Figure 6 shows an example from the populated ontol- 
ogy with sentence and impact instances. 

Application 

Our system is implemented based on the General Archi- 
tecture for Text Engineering (GATE) [24], a Java-based 
open source component framework for text processing. 
Our system can be run stand-alone, embedded in other 
applications, or deployed on a cloud computing infra- 
structure for large-scale document processing using the 
GATE Cloud Parallelizer (GCP). Additionally, we pro- 
vide a number of semantic access methods, described 
below. 

Web Service invocation 

To use our pipelines as a web service, we created OWL 
service descriptions for the Semantic Assistants frame- 
work [35]. Two services are currently provided, one for 
mutation tagging and one for impact detection. These 
services are described through metadata expressed in an 
OWL ontology. Both services can then be deployed in a 



Semantic Assistants server. The server allows any web 
client to send documents to the service through stan- 
dard web service invocations and receive the results in 
XML format. Additionally, Semantic Assistants-enabled 
clients, like OpenOffice or the Firefox web browser 
(Figure 7), can directly send documents to the services 
on behalf of a user. 
Querying impact information 

Presenting impact information in a structured format 
allows users to quickly access the relevant information 
[36]. For example, an end user might be interested to 
search for impacts of a specific mutation, or all the 
altered properties of an impact. Towards this end, we 
export the extracted information to the ontology; conse- 
quently, we can simply query the ontological knowledge 
base for our desired information. Figure 8 simply 
queries for the mutations that increased the activity of a 
protein using the SPARQL query language; The results 
of this query are shown in Figure 9. 

Results 

We analyzed the performance of our approaches for 
mutation series and impact extraction in detail on dif- 
ferent corpora. First, the evaluation of the mutation 
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pi Document 
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has subclass nas subclass 
NonMeasurabielmpact GMeasurablelmpact 
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http: //www.semanticsoftw are. info/ nip. owl#3409__The_ mutations_N 1 lD_and_N38E_increased_the_hall-life_to_about_100_min. 

□ 

Figure 6 An example of the populated impact ontology OMM can export results into the ontology shown in Figure 2 through so-called 
ontology population. Each occurrence of a mutation, impact, etc. is connected with the corresponding domain concepts and additionally linked 
to its source sentence and document. 
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Figure 7 OMM web browser integration Impact extraction can be dynamically requested from the Firefox browser, which is then executed 
through the OMM Semantic Assistants web service. Results are mapped as annotations onto the viewed document, which facilitates scientific 
iterature analysis. 



series detection module is investigated. Then, the effec- 
tiveness of the impact extraction, as well as grounding 
to the correct mutation is measured on literature 
describing enzymes. 

Data 

To evaluate the performance of the system for each task, 
we prepared two corpora: Mutation Series and Impact. 
Mutation series detection corpus 

We prepared a corpus containing 11 full-text PubMed 
articles on enzymes to assess the efficiency of the system 
in detecting mutation series. We ensured that all these 



documents contain multiple mutation mentions. These 
documents contain a total of 1306 mutations and 271 
mutation series. The list of documents used for evalua- 
tion is provided in an additional file [see Additional file 
1]. 

Impact extraction corpora 

We selected 40 PubMed IDs and manually annotated 
them with the impact information. For each impact 
mention, only the part of the sentence mentioning the 
mutation and the impact was selected. Thus, if a sen- 
tence expresses multiple impacts, all are annotated sepa- 
rately [see Additional file 2 for manual annotations]. 



PREFIX onto: <http://www.owl-ontologies.eom/unnamed.owl#> 
SELECT ?Mutation ?Sentence 

FROM <http://www.owl-ontologies.com/un named. owl#> 
WHERE {?Mutation onto:mutationMutImpactRel ?MutationImpact. 
?MutationImpact onto: appearsln ?Sentence. 
?MutationImpact onto:impactOn ?ProteinProperty. 
FILTER regex(str(? Protein Property ), "activity") 
FILTER regex(str(?MutationImpact), "increase")} 

Figure 8 SPARQL query example Text mining results can be obtained by querying the populated ontology. The example shown here is the 
SPARQL translation of the question, find ail mutations that increased the activity of a protein. 
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Figure 9 Query Result Example output from the query shown in Figure 8, listing all mutations that increased the protein property "activity". 



The impacts are grounded to the respective mutations 
and the EC number of experimented enzymes is speci- 
fied. The list of documents used for evaluation is pro- 
vided in an additional file [see Additional file 1]. 

Evaluation 

First, the correctness of the mutation series extraction is 
assessed. Second, the effectiveness of the impact extrac- 
tion, as well as grounding to the correct mutation, is 
measured on literature describing enzymes. Since the 
mutation series detection relies on correctly recognizing 
mutations, we first show the mutation detection result 
for each system, followed by the result of our mutation 
series detection. 
Quantitative evaluation metrics 

The evaluation procedure is performed by comparing 
the manually annotated texts with the annotations gen- 
erated by our system, measured with the metrics 
explained in this section. The number of correctly iden- 
tified items as a percentage of the number of items 
identified is specified as precision (P). The number of 
correctly identified items as a percentage of the total 
number of correct items is defined as recall (R). The F- 
measure (F) is used as a weighted (geometric) average of 
precision and recall. Finally, accuracy is the percentage 
of decisions that are correct. The performance results 
are computed according to different criteria: Strict (S) 
and Lenient (L). In "Strict", we measure all partially cor- 
rect responses as incorrect. In "Lenient", all partially 
correct responses are measured as correct [24]. 
Mutation detection evaluation 

Both MutationFinder [6] and our OMM MutationTag- 
ger were applied to 11 manually annotated documents; 
the comparative results of the systems are shown in 
Table 5. 

Mutation series evaluation 

We also verified the correctness of the extracted muta- 
tion series. The results are presented in Table 6. 
Impact analysis evaluation 

Here, we analyse how correctly our system can detect all 
impacts expressed in a sentence. We further investigate 
the performance of our developed grounding algorithm 
(see Grounding Section). The performance of our system 
on our manually annotated corpus of 40 documents is 
assessed and the results are summarized in Tables 7 and 
8. OMM system results are provided in an additional 
file [see Additional file 3]. 



In our corpora, 5% of all point mutations are 
expressed in natural language; thus, in an experiment, 
we considered the results of both MutationTagger and 
MutationFinder for the impact detection and grounding 
tasks. As can be seen in Table 7, this combination of 
both systems slightly increases recall at the expense of 
precision. 

Discussion 

False negatives of impact detection are mainly due to 
author-defined mutation names. For example, PMID 
10074357, reporting on the mutant of alcohol dehydro- 
genase, uses mSsADH to refer to N249Y in the document. 
Authors of the paper PMID 10544015 also assign No. 87 
to a mutation containing 8 amino acid substitutions; 
T71A, K264E, L317S, T331A, R407L, S415G, K455I and 
E277G. Since we rely on mutation mentions and the key- 
words introduced earlier in Impact Detection section to 
detect impacts, these impacts are not detected. 

Tables from processed PDF files are converted into 
indistinct textual blocks, and in case they are reporting 
the impacts of mutations, our system detects them as 
impacts. These mentions are not manually annotated, 
thus they are considered as false positives. 

Conclusions 

Mutation impacts are essential for understanding the 
role of mutations. The data regarding the mutations and 
impacts exists primarily in scientific publications. In this 
paper, we described Open Mutation Miner (OMM), a 
comprehensive, modular, open source text mining sys- 
tem for extracting and grounding mutation impacts, 
affected protein properties and magnitudes of effects. 

The performance of our system is evaluated on multi- 
ple corpora. Furthermore, we created additional manual 
annotations for the biomedical literature. Our ontology 
population approach provides comprehensive informa- 
tion to a biologist and can be queried or further inte- 
grated with other systems. 

Further work will address mutation co-reference reso- 
lution; In journal papers, very often the authors use pro- 
nominal or nominal mutation references that hinders 
the grounding of impacts. All occurrences of mutations, 
including nominal and pronominal references are 
required to be detected. Deletion and insertion muta- 
tions pose additional challenges to be addressed in a 
future version. 
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Table 5 Mutation detection evaluation 
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Comparative evaluation of the mutation detection performance, using MutationFinder and OMM MutationTagger. 










Table 6 Mutation series detection evaluation 


















Mutation series detection (MutationFinder) 


Document (PMID) 


Correct 


Partially C. 


Missing 


Spurious 




Strict 






Lenient 














P 


R 


F 


P 


R 


F 


10860737 


13 


0 


0 


0 


100% 


100% 


100% 


1 00% 


1 00% 


100% 


1 2604240 


11 


0 


0 


0 


1 00% 


100% 


1 00% 


1 00% 


1 00% 


100% 


1 2664592 


4 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


1 2702265 


1 


0 


7 


0 


1 00% 


12% 


22% 


1 00% 


12% 


22% 


12890481 


26 


0 


0 


0 


1 00% 


1 00% 


1 00% 


100% 


1 00% 


100% 


12902331 


51 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


15026177 


13 


0 


0 


0 


1 00% 


1 00% 


1 00% 


100% 


1 00% 


1 00% 


17761677 


1 


0 


0 


0 


1 00% 


100% 


1 00% 


1 00% 


1 00% 


100% 


19143837 


40 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


9731776 


1 


0 


0 


0 


1 00% 


1 00% 


100% 


100% 


1 00% 


1 00% 


14592457 


1 


0 


102 


0 


100% 


1% 


2% 


1 00% 


1% 


2% 


Average 


162 


0 


109 


0 


100% 


60% 


75% 


100% 


60% 


75% 
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Table 6 Mutation series detection evaluation (Continued) 



Mutation series detection (MutationTagger) 
Document (PMID) Correct Partially C. Missing Spurious Strict Lenient 













p 


R 


F 


p 


R 


F 


10860737 


13 


0 


0 


0 


100% 


100% 


100% 


1 00% 


1 00% 


100% 


1 2604240 


11 


0 


0 


0 


1 00% 


100% 


1 00% 


1 00% 


1 00% 


1 00% 


1 2664592 


4 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


1 2702265 


7 


0 


1 


0 


1 00% 


88% 


93% 


1 00% 


88% 


93% 


12890481 


26 


0 


0 


0 


1 00% 


1 00% 


1 00% 


100% 


1 00% 


1 00% 


12902331 


51 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


15026177 


13 


0 


0 


0 


1 00% 


1 00% 


1 00% 


1 00% 


1 00% 


1 00% 


17761677 


1 


0 


0 


0 


1 00% 


1 00% 


100% 


1 00% 


1 00% 


1 00% 


19143837 


40 


0 


0 


0 


100% 


100% 


100% 


100% 


100% 


100% 


9731776 


1 


0 


0 


0 


1 00% 


1 00% 


100% 


100% 


1 00% 


1 00% 


14592457 


102 


0 


1 


0 


100% 


99% 


100% 


1 00% 


99% 


1 00% 


Average 


269 


0 


2 


0 


100% 


99% 


100% 


100% 


99% 


100% 



Table 7 Impact Detection Evaluation on 40 full-text Documents 



Impact Detection Evaluation 




MutationTagger 


MutationFinder 


MutationTagger+ MutationFinder 


#Documents 


Precision Recall F-Measure 


Precision Recall F-Measure 


Precision Recall F-Measure 



40 704% 713% 70.8% 71.1% 71.4% 7124% 70.8% 71.5% 71.1% 



Table 8 Impact Grounding Evaluation on 40 manually annotated documents 



Impact Grounding Evaluation - MutationTagger 


Accuracy 


76.5% 


Impact Grounding Evaluation - MutationFinder 


Accuracy 


76.9% 


Impact Grounding Evaluation - MutationTagger + MutationFinder 


Accuracy 


77% 



Abbreviations used 

AMENDA: Automatic Mining of ENzyme DAta; 
BRENDA: BRaunschweig ENzyme DAtabase; DL: Descrip- 
tion Logic; EC: Enzyme Commission; FRENDA: Full Refer- 
ence ENzyme DAta; GATE: General Architecture for Text 
Engineering; GO: Gene Ontology; JAPE: Java Annotation 
Pattern Language; KID: Kinetic Database; NLP: Natural 
Language Processing; OWL: Web Ontology Language; 
RDF: Resource Description Framework; SI: Systeme Inter- 
national d'unites; SPARQL: SPARQL Protocol and RDF 
Query Language; XML: extensible Markup Language. 

Additional material 



Additional file 1: List of documents used for evaluation The list of 
documents (PubMed IDs) used in the corpora for mutation series and 
impact analysis evaluation. 



Additional file 2: Manual annotations Manual annotations of 40 
documents used for evaluation, presented in tab-delimited format. 

Additional file 3: OMM system results The output of our OMM system, 
listing all detected impacts that are grounded to their respective 
mutations. 
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